08.12.2012 Views

IBM System x GPFS Storage Server - stfc

IBM System x GPFS Storage Server - stfc

IBM System x GPFS Storage Server - stfc

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

<strong>IBM</strong> <strong>System</strong> x <strong>GPFS</strong> <strong>Storage</strong> <strong>Server</strong><br />

Crispin Keable<br />

Technical Computing Architect<br />

1<br />

© 2012 <strong>IBM</strong> Corporation


2<br />

<strong>IBM</strong> Technical Computing comprehensive portfolio uniquely<br />

addresses supercomputing and mainstream client needs<br />

Power<br />

<strong>System</strong>s TM<br />

Engine for<br />

faster insights<br />

<strong>IBM</strong> Platform<br />

LSF®<br />

Family<br />

<strong>IBM</strong> Platform<br />

HPC<br />

Pure<strong>System</strong>s TM<br />

Integrated expertise<br />

for improved economics<br />

<strong>GPFS</strong><br />

<strong>System</strong> x ®<br />

Redefining x86<br />

Blue Gene<br />

<strong>System</strong> x<br />

®<br />

Extremely fast,<br />

energy efficient<br />

supercomputer<br />

<strong>System</strong> <strong>Storage</strong>®<br />

Smarter storage<br />

HPC Cloud<br />

<strong>IBM</strong> Platform<br />

Symphony<br />

Family<br />

<strong>IBM</strong> Platform<br />

Cluster<br />

Manager<br />

solutions<br />

<strong>GPFS</strong><br />

<strong>Storage</strong> <strong>Server</strong><br />

Big data<br />

storage<br />

iDataPlex ®<br />

Fast, dense, flexible<br />

Technical<br />

Computing<br />

for Big Data<br />

Intelligent Cluster<br />

Factory-integrated, interoperability-tested system with<br />

compute, storage, networking and cluster management<br />

© 2012 <strong>IBM</strong> Corporation


“Perfect Storm” of Synergetic Innovations<br />

Disruptive Integrated <strong>Storage</strong> Software:<br />

Declustered RAID with <strong>GPFS</strong> reduces<br />

overhead and speeds rebuilds by ~4-6x<br />

Performance: POWER, x86 cores<br />

more powerful than special-use<br />

controller chips<br />

High-Speed Interconnect: Clustering &<br />

storage traffic, including failover<br />

(PERCS/Power fabric, InfiniBand, or 10GE)<br />

<strong>GPFS</strong> Native RAID<br />

<strong>Storage</strong> <strong>Server</strong><br />

Big Data Converging with HPC Technology<br />

<strong>Server</strong> and <strong>Storage</strong> Convergence<br />

Data Integrity, Reliability & Flexibility:<br />

End-to-end checksum, 2- & 3-fault<br />

tolerance, application-optimized RAID<br />

Integrated<br />

Hardware/Packaging:<br />

<strong>Server</strong> & <strong>Storage</strong> co-packaging<br />

improves density & efficiency<br />

Cost/Performance: Software-based<br />

controller reduces HW overhead & cost, and<br />

enables enhanced functionality.<br />

© 2012 <strong>IBM</strong> Corporation


High End<br />

POWER<br />

1 Rack performs a 1TB Hadoop<br />

TeraSort in less than 3 minutes!<br />

<strong>IBM</strong> <strong>GPFS</strong> Native RAID p775:<br />

High-Density <strong>Storage</strong> + Compute <strong>Server</strong><br />

• Based on Power 775 / PERCS Solution<br />

• Basic Configuration:<br />

• 32 Power7 32-core high bandwidth servers<br />

• Configurable as <strong>GPFS</strong> Native RAID storage<br />

controllers, compute servers, I/O servers or spares<br />

• Up to 5 Disk Enclosures per rack<br />

• 384 Drives and 64 quad-lane SAS ports each<br />

• Capacity: 1.1 PB/rack (900 GB SAS HDDs)<br />

• Bandwidth: >150 GB/s per rack Read BW<br />

• Compute Power: 18 TF + node sparing<br />

• Interconnect: <strong>IBM</strong> high-BW optical PERCS<br />

• Multi-rack scalable, fully water-cooled<br />

4<br />

© 2012 <strong>IBM</strong> Corporation


5<br />

How does GNR work?<br />

Clients<br />

File/Data <strong>Server</strong>s<br />

Custom Dedicated<br />

Disk Controllers<br />

NSD File <strong>Server</strong> 1<br />

x3650<br />

NSD File <strong>Server</strong> 2<br />

JBOD Disk Enclosures<br />

Migrate RAID<br />

and Disk<br />

Management to<br />

Commodity File<br />

<strong>Server</strong>s!<br />

Clients<br />

NSD File <strong>Server</strong> 1<br />

<strong>GPFS</strong> Native RAID<br />

NSD File <strong>Server</strong> 2<br />

<strong>GPFS</strong> Native RAID<br />

JBOD Disk Enclosures<br />

FDR IB<br />

10 GbE<br />

© 2012 <strong>IBM</strong> Corporation


6<br />

x3650 M4<br />

“Twin Tailed”<br />

JBOD<br />

Disk Enclosure<br />

A Scalable Building Block Approach to <strong>Storage</strong><br />

Complete <strong>Storage</strong> Solution<br />

Data <strong>Server</strong>s, Disk (NL-SAS and SSD), Software, InfiniBand and Ethernet<br />

Model 24:<br />

Light and Fast<br />

4 Enclosures, 20U<br />

232 NL-SAS, 6 SSD<br />

10 GB/Sec<br />

Performance based on IOR BM<br />

Model 26:<br />

HPC Workhorse!<br />

6 Enclosures, 28U<br />

348 NL-SAS, 6 SSD<br />

12 GB/sec<br />

High-Density HPC Option<br />

18 Enclosures<br />

2 - 42U Standard Racks<br />

1044 NL-SAS 18 SSD 6<br />

36 GB/sec<br />

© 2012 <strong>IBM</strong> Corporation


7<br />

<strong>GPFS</strong> Native RAID Feature Detail<br />

• Declustered RAID<br />

– Data and parity stripes are uniformly partitioned and distributed across a disk array.<br />

– Arbitrary number of disks per array (unconstrained to an integral number of RAID stripe widths)<br />

• 2-fault and 3-fault tolerance<br />

– Reed-Solomon parity encoding<br />

– 2 or 3-fault-tolerant: stripes = 8 data strips + 2 or 3 parity strips<br />

– 3 or 4-way mirroring<br />

• End-to-end checksum & dropped write detection<br />

– Disk surface to <strong>GPFS</strong> user/client<br />

– Detects and corrects off-track and lost/dropped disk writes<br />

• Asynchronous error diagnosis while affected IOs continue<br />

– If media error: verify and restore if possible<br />

– If path problem: attempt alternate paths<br />

• Supports live replacement of disks<br />

– IO ops continue on for tracks whose disks have been removed during carrier service<br />

7<br />

© 2012 <strong>IBM</strong> Corporation


Declustering – Bringing parallel performance to disk maintenance<br />

8<br />

� Conventional RAID: Narrow data+parity arrays<br />

– Rebuild can only use the IO capacity of 4 (surviving) disks<br />

4x4 RAID stripes<br />

(data plus parity)<br />

20 disks (5 disks per 4 conventional RAID arrays)<br />

� Declustered RAID: Data+parity distributed over all disks<br />

– Rebuild can use the IO capacity of all 19 (surviving) disks<br />

16 RAID stripes<br />

(data plus parity)<br />

Failed Disk<br />

20 disks in 1 Declustered RAID array<br />

Failed Disk<br />

Striping across all arrays, all file<br />

accesses are throttled by array 2’s<br />

rebuild overhead.<br />

Load on files accesses are<br />

reduced by 4.8x (=19/4)<br />

during array rebuild.<br />

8<br />

© 2012 <strong>IBM</strong> Corporation


9<br />

Declustered RAID Example<br />

7 stripes per group<br />

(2 strips per stripe)<br />

3 1-fault-tolerant<br />

mirrored groups<br />

(RAID1)<br />

3 groups<br />

6 disks<br />

spare<br />

disk<br />

7 spare<br />

strips<br />

7 disks<br />

21 stripes<br />

(42 strips)<br />

49 strips<br />

© 2012 <strong>IBM</strong> Corporation


10<br />

Rebuild Overhead Reduction Example<br />

time<br />

failed disk<br />

Rd Wr<br />

Rebuild activity confined to just<br />

a few disks – slow rebuild,<br />

disrupts user programs<br />

time<br />

Rd-Wr<br />

failed disk<br />

Rebuild activity spread<br />

across many disks, less<br />

disruption to user programs<br />

Rebuild overhead reduced by 3.5x<br />

© 2012 <strong>IBM</strong> Corporation


<strong>GPFS</strong> Native Raid Advantages<br />

11<br />

• Lower Cost!<br />

– Software RAID – No hardware<br />

storage controller<br />

• 10-30% lower cost with higher<br />

performance<br />

– Off-the-shelf SBODs<br />

• Generic low-cost disk enclosures<br />

• Standardized in-band SES management<br />

– Standard Linux or AIX<br />

– Generic high-volume servers<br />

– Component of <strong>GPFS</strong><br />

• Industry Leading Performance<br />

• Extreme Data Integrity<br />

– 2- and 3-fault-tolerant erasure<br />

codes<br />

• 80% and 73% storage efficiency<br />

– End-to-end checksum<br />

– Protection against lost writes<br />

– Fastest Rebuild times using<br />

Declustered RAID<br />

– Declustered RAID – Reduced app load during rebuilds<br />

• Up to 3x lower overhead to applications<br />

– Aligned full-stripe writes – disk limited<br />

– Small writes – backup-node NVRAM-log-write limited<br />

– Faster than alternatives today – and tomorrow!<br />

© 2012 <strong>IBM</strong> Corporation


12<br />

© 2012 <strong>IBM</strong> Corporation


Introducing <strong>IBM</strong> <strong>System</strong> x <strong>GPFS</strong> <strong>Storage</strong> <strong>Server</strong>:<br />

Bringing HPC Technology to the Mainstream<br />

� Better, Sustained Performance<br />

- Industry-leading throughput using efficient De-Clustered RAID Techniques<br />

� Better Value<br />

– Leverages <strong>System</strong> x servers and Commercial JBODS<br />

� Better Data Security<br />

– From the disk platter to the client.<br />

– Enhanced RAID Protection Technology<br />

� Affordably Scalable<br />

– Start Small and Affordably<br />

– Scale via incremental additions<br />

– Add capacity AND bandwidth<br />

� 3 Year Warranty<br />

– Manage and budget costs<br />

� IT-Facility Friendly<br />

– Industry-standard 42u 19 inch rack mounts<br />

– No special height requirements<br />

– Client Racks are OK!<br />

� And all the Data Management/Life Cycle Capabilities of <strong>GPFS</strong> – Built in!<br />

© 2012 <strong>IBM</strong> Corporation


14<br />

Declustered RAID6 Example<br />

14 physical disks / 3 traditional RAID6 arrays / 2 spares 14 physical disks / 1 declustered RAID6 array / 2 spares<br />

failed disks<br />

Decluster<br />

data,<br />

parity<br />

and<br />

spare<br />

failed disks failed disks<br />

Number of faults per stripe<br />

Red Green Blue<br />

0 2 0<br />

0 2 0<br />

0 2 0<br />

0 2 0<br />

0 2 0<br />

0 2 0<br />

0 2 0<br />

Number of stripes with 2 faults = 7<br />

failed disks<br />

Number of faults per stripe<br />

Red Green Blue<br />

1 0 1<br />

0 0 1<br />

0 1 1<br />

2 0 0<br />

0 1 1<br />

1 0 1<br />

0 1 0<br />

Number of stripes with 2 faults = 1<br />

© 2012 <strong>IBM</strong> Corporation


15<br />

Where <strong>GPFS</strong> <strong>Storage</strong> <strong>Server</strong> Fits<br />

Local<br />

University<br />

Petroleum<br />

Media/Ent.<br />

Financial<br />

Bio/Life SONAS Science<br />

DCS3700<br />

Services<br />

CAE<br />

DCS3700+<br />

Higher End<br />

Universities<br />

Direct Attached<br />

(DS3000 + V3700)<br />

Government<br />

<strong>GPFS</strong> <strong>Storage</strong> High End <strong>Server</strong><br />

Research<br />

© 2012 <strong>IBM</strong> Corporation


Data Protection Designed for 200K+ Drives!<br />

• Platter-to-Client Protection<br />

– Multi-level data protection to detect and prevent bad writes and on-disk data loss<br />

– Data Checksum carried and sent from platter to client server<br />

• Integrity Management<br />

– Rebuild<br />

• Selectively rebuild portions of a disk<br />

• Restore full redundancy, in priority order, after disk failures<br />

– Rebalance<br />

• When a failed disk is replaced with a spare disk, redistribute the free space<br />

– Scrub<br />

• Verify checksum of data and parity/mirror<br />

• Verify consistency of data and parity/mirror<br />

• Fix problems found on disk<br />

– Opportunistic Scheduling<br />

• At full disk speed when no user activity<br />

• At configurable rate when the system is busy<br />

16<br />

© 2012 <strong>IBM</strong> Corporation


Non-Intrusive Disk Diagnostics<br />

17<br />

17<br />

• Disk Hospital: Background determination of problems<br />

– While a disk is in hospital, GNR non-intrusively and immediately returns<br />

data to the client utilizing the error correction code.<br />

– For writes, GNR non-intrusively marks write data and reconstructs it<br />

later in the background after problem determination is complete.<br />

• Advanced fault determination<br />

– Statistical reliability and SMART monitoring<br />

– Neighbor check<br />

– Media error detection and correction<br />

© 2012 <strong>IBM</strong> Corporation


18<br />

GSS – End-to-end Checksums and Version Numbers<br />

• End-to-end checksums<br />

– Write operation<br />

• Between user compute node and GNR node<br />

• From GNR node to disk with version number<br />

– Read operation<br />

• From disk to GNR node with version number<br />

• From IO node to user compute node<br />

Data<br />

Checksum<br />

Trailer<br />

• Version numbers in metadata are used to validate checksum trailers<br />

for dropped write detection<br />

– Only a validated checksum can protect against dropped writes<br />

© 2012 <strong>IBM</strong> Corporation


19<br />

GSS Data Integrity<br />

� Silent data corruption<br />

– Caused by disk off-track writes, dropped writes (e.g., disk<br />

firmware bugs), or undetected read errors<br />

� Old adage: “No data is better than bad data”<br />

� Proper data integrity checking requires end-to-end<br />

checksum plus dropped write detection.<br />

read A<br />

disk returns A<br />

A<br />

read A<br />

disk returns B!<br />

© 2012 <strong>IBM</strong> Corporation


GNR / Mestor Future Research Directions<br />

• GNR “ring” configuration –<br />

– Adaptation of Building Block Approach<br />

– Shared (Dual Ported) disks<br />

• Data managed by storage nodes<br />

– Overlapping cfg: ½ the nodes of std ctlrs<br />

• <strong>Storage</strong> node pairs to shared disks<br />

• Scale out to many storage nodes<br />

• Global namespace / Disk mgmt<br />

• Mestor<br />

– Non Shared disks approach / Network RAID<br />

• Data striped across storage nodes<br />

• <strong>Storage</strong> node to captive disks<br />

• Scale out to many storage nodes<br />

• Global namespace / Disk mgmt<br />

20<br />

<strong>IBM</strong> Confidential<br />

Fabrics<br />

© 2012 <strong>IBM</strong> Corporation


<strong>GPFS</strong> Native RAID for <strong>System</strong> x Proposed Timeline<br />

V1.0: Getting Started<br />

-<strong>System</strong> x Intelligent Clusters “Solution”<br />

-Ordered Through <strong>System</strong> x Int. Cluster process<br />

-Software installed and configured at customer<br />

location by end-user or <strong>IBM</strong> Services<br />

-Support coordinated by Intelligent Clusters<br />

-Early Access customers<br />

2012 2013 2014<br />

21<br />

First Customer Ship<br />

ISC12<br />

V1.5<br />

-Solution sold via Intelligent Clusters<br />

-Bug Fixes<br />

-Support provided via I.C. Standard<br />

mechanism<br />

-Upgrade path for current DCS3700<br />

Customers defined<br />

-Drive Roll for New Drives (4 TB NL-SAS)<br />

SC12 v1.0<br />

Announce<br />

ISC13<br />

V1.5<br />

Announce<br />

V2.0<br />

Complete Machine-Type/Model, Fully Supported<br />

-Plug-n-Play <strong>GPFS</strong> Appliance<br />

-GUI for Management<br />

-Evaluate smaller form factor<br />

-12 Drives Enclosures?<br />

<strong>IBM</strong> Confidential<br />

V2.5 Miniaturization Release<br />

-Support Entry-level Based<br />

upon MESTOR<br />

-<strong>Storage</strong>-Rich <strong>Server</strong>s (internal<br />

drives)<br />

-RAID Across the <strong>Server</strong>s<br />

2015<br />

© 2012 <strong>IBM</strong> Corporation

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!